Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16174][SQL] Improve OptimizeIn optimizer to remove literal repetitions #13876

Closed
wants to merge 12 commits into from
Closed

[SPARK-16174][SQL] Improve OptimizeIn optimizer to remove literal repetitions #13876

wants to merge 12 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 23, 2016

What changes were proposed in this pull request?

This PR improves OptimizeIn optimizer to remove the literal repetitions from SQL IN predicates. This optimizer prevents user mistakes and also can optimize some queries like TPCDS-36.

Before

scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN,TN,TN,TN,TN,TN,TN)
+- Generate explode([CA,TN]), false, false, [state#6]
   +- Scan OneRowRelation[]

After

scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN)
+- Generate explode([CA,TN]), false, false, [state#6]
   +- Scan OneRowRelation[]

How was this patch tested?

Pass the Jenkins tests (including a new testcase).

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61126 has finished for PR 13876 at commit e33b7f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -793,6 +794,20 @@ object ConstantFolding extends Rule[LogicalPlan] {
}

/**
* Removes literal repetitions from IN predicate
*/
object RemoveLiteralRepetitionFromIn extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this just go into OptimzieIn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also does this need to be literal specific

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @rxin .

  1. Sure, I can merge this into OptimizeIn.
  2. Also, it can be used for deterministic expressions.
    I'm just here focus on literals. May I handle both cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea why don't we handle both

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! No problem.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16174][SQL] Add RemoveLiteralRepetitionFromIn optimizer [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions Jun 23, 2016
@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61136 has finished for PR 13876 at commit e0239a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61135 has finished for PR 13876 at commit 6180daf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61137 has finished for PR 13876 at commit 5a9f4ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR again when you have some time?

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Now, variable l is replaced with list.

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61232 has finished for PR 13876 at commit cf7b869.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
For this OptimizeIn PR, please let me know if we need further optimization.
Thank you always.

@SparkQA
Copy link

SparkQA commented Jun 27, 2016

Test build #61289 has finished for PR 13876 at commit 61552f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this OptimizeIn PR?

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Do you want me to split this OptimizeIn into another file, too?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions Jun 28, 2016
@SparkQA
Copy link

SparkQA commented Jun 30, 2016

Test build #61563 has finished for PR 13876 at commit 53363e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 30, 2016

Hi, @rxin .
Could you review this OptimizeIn again when you have some time?

@dongjoon-hyun
Copy link
Member Author

Rebased to the master.

@SparkQA
Copy link

SparkQA commented Jul 2, 2016

Test build #61662 has finished for PR 13876 at commit 63b3ecd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61760 has finished for PR 13876 at commit 6628c7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR again?

val hSet = list.map(e => e.eval(EmptyRow))
InSet(v, HashSet() ++ hSet)
case i @ In(v, list) =>
val (deterministics, others) = list.partition(_.deterministic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question i have is how often do we see an in expression with some expressions being deterministic and some nondeterministic? if not, i'd just simplify this so we only do it if everything is deterministic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. In real situation, case i @ In(v, list) if list.forall(_.deterministic) will cover the most cases.

I'll update like that. Thank you for review again!

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61860 has finished for PR 13876 at commit 314bc74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Now, it's simplified for the all deterministic cases and passed the Jenkins again.
Thank you for advice.

} else if (newList.length < list.length) {
i.copy(v, newList)
} else { // newList.length == list.length
i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can bring some performance concerns because we are doing a lot of work in order to return the original query, and given the optimizer is iterative, it would spend a lot of cycles just doing this.

Can we introduce a flag (lazy val) to the In expression to check whether it is optimizable? If it is not, then we shouldn't even go into the case. Something like

case class In(...) {
  lazy val inSetConvertable: Boolean = list.forall(_.deterministic)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. That sounds great. I'll fix soon.

@@ -132,6 +132,7 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate
}

override def children: Seq[Expression] = value +: list
lazy val inSetConvertible = children.forall(_.deterministic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad - we should put newList.forall(_.isInstanceOf[Literal]) here too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark the type explicitly since this is a public funciton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then, the semantic is different. What you mean is just improving InSet.
My original PR was about for deletion about all deterministic duplications.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, if that is your intention, Okay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. We need to update all PR/JIRA description, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61875 has finished for PR 13876 at commit 23d6e30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove literal repetitions Jul 7, 2016
@dongjoon-hyun
Copy link
Member Author

Now, the scope of PR is reduced a lot. But, I hope this PR still covers majority of real queries.
Thank you for many advice.

*/
case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsDown {
case In(v, list) if !list.exists(!_.isInstanceOf[Literal]) &&
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, here is regression. Originally, v could be non-deterministic.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61878 has finished for PR 13876 at commit 125036a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61887 has finished for PR 13876 at commit 63a4a79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61888 has finished for PR 13876 at commit ccf972d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val hSet = newList.map(e => e.eval(EmptyRow))
InSet(v, HashSet() ++ hSet)
} else if (newList.size < list.size) {
expr.copy(value = v, list = newList)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to copy value here, do you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's a whole value. We had better create the expression In.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry what i meant was ... we only need to do

expr.copy(list = newList)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. My bad!

@rxin
Copy link
Contributor

rxin commented Jul 7, 2016

Looks pretty good.

cc @cloud-fan for another look.

@cloud-fan
Copy link
Contributor

LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @cloud-fan .

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61898 has finished for PR 13876 at commit eead3db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

There is one failure in HiveSparkSubmitSuite. It seems to be irrelevant. I'll retry the test.

- SPARK-8020: set sql conf in spark conf *** FAILED *** (31 seconds, 981 milliseconds)

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61897 has finished for PR 13876 at commit f068e4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61901 has finished for PR 13876 at commit eead3db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

At this time, it passed as expected.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in a04cab8 Jul 7, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you for review and merging, @cloud-fan and @rxin .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-16174 branch July 20, 2016 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants